The previous class session introduced decision trees as a fundamental algorithm used in machine learning. We also explored how spatially lagged attributes can be used as features in a predictive model as a means of accounting for or representing spillover effects or contagion processes. Recall that in a regression context, the presence of spatial autocorrelation in our data is problematic and something to be removed or at the very least, the basis for treating any inferences with caution. On the other hand 👈 spatial autocorrelation in a machine learning context is viewed as just another source of patterns in the data that we can harness to bolster our predictive accuracy 😃
This Assignment continues our foray into geospatial machine learning. The assigned readings for this week are available here and may be useful as you make your way through this activity:
First we load the necessary R packages and as usual, the necessary data have been placed in the data folder for you 🆒 As you read this week (and last week), the general aim of the two Steif chapters is to use observed home price data from Boston to train, apply, and evaluate a predictive model of home prices. However, we will pivot and follow a similar path using local Charlottesville data. We pull down the neighborhood boundaries layer from the Charlottesville open data portal, then apply a map projection to convert the data to a projected coordinate system. This Assignment also integrates data on reported crimes as a factor in predicting residential housing prices. The primary Charlottesville dataset is ready to go, but you should know that it does not contain all of the same variables as the Boston dataset from the readings. Instead, we have the following:
Eventually, we will add the following features as part of the Assignment:
The reported crimes data were taken from here and the remaining variables were pieced together by the instructor from Real Estate (Residential Details), Parcel Details 2, and Real Estate (Sales). Let’s see what we can do with these data!
# Get neighborhood polygons from the open data portal in GeoJSON format,
# then apply a map projection...
nhoods_utm_sf <-
st_read("https://gisweb.charlottesville.org/cvgisweb/rest/services/OpenData_1/MapServer/97/query?outFields=*&where=1%3D1&f=geojson") %>%
st_transform("EPSG:32617")
## Reading layer `OGRGeoJSON' from data source
## `https://gisweb.charlottesville.org/cvgisweb/rest/services/OpenData_1/MapServer/97/query?outFields=*&where=1%3D1&f=geojson'
## using driver `GeoJSON'
## Simple feature collection with 68 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -78.52375 ymin: 38.00959 xmax: -78.44651 ymax: 38.07049
## Geodetic CRS: WGS 84
cvilleHomes_utm <- st_read("./data/cvilleHomes_utm.geojson")
## Reading layer `cvilleHomes_utm' from data source
## `C:\Users\bw6xs\Documents\PLAN 6122\2023\Assignments\Lab Assignment 7\Lab Assignment 7\data\cvilleHomes_utm.geojson'
## using driver `GeoJSON'
## Simple feature collection with 4023 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 717392.7 ymin: 4209876 xmax: 723287.8 ymax: 4216385
## Projected CRS: WGS 84 / UTM zone 17N
# Collapse the many categories on the number of floors attribute, as we
# did last week so there is a more manageable number of categories...
cvilleHomes_utm_sf <-
cvilleHomes_utm %>%
mutate(NmbrOfS.collapse = case_when(
NmbrOfS >= 0 & NmbrOfS < 3 ~ "Up to 2 Floors",
NmbrOfS >= 3 & NmbrOfS < 4 ~ "3 Floors",
NmbrOfS > 4 ~ "4+ Floors"),
NmbrOfS.cat = as_factor(NmbrOfS.collapse)) %>%
filter(SalAmnt > 0.0,
SqrFtFL > 0.0) %>%
mutate_at(c("Bedroms", "FllBthr", "HlfBthr", "Fireplc"), as.numeric) %>%
mutate_at(c("Heating", "BsmntTy", "ExtrnlW"), as_factor)
# Read in the reported crimes information...
cvilleCrimes <- st_read("./data/cvilleCrimes.shp")
## Reading layer `cvilleCrimes' from data source
## `C:\Users\bw6xs\Documents\PLAN 6122\2023\Assignments\Lab Assignment 7\Lab Assignment 7\data\cvilleCrimes.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 4188 features and 10 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -78.53321 ymin: 38.00525 xmax: -78.4303 ymax: 38.07422
## Geodetic CRS: WGS 84
cvilleCrimes_utm <- st_transform(cvilleCrimes, crs = "EPSG:32617")
# Extract only reports of aggravated assault and get rid of observations
# with missing values, then apply consistent map projection...
cvilleAssaults_utm_sf <-
cvilleCrimes_utm %>%
filter(Offense == "Assault Aggravated") %>%
drop_na() %>%
distinct()
# Note that the LENGTHUNIT for our sf objects is meters...
st_crs(cvilleHomes_utm_sf)
## Coordinate Reference System:
## User input: WGS 84 / UTM zone 17N
## wkt:
## PROJCRS["WGS 84 / UTM zone 17N",
## BASEGEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]],
## CONVERSION["UTM zone 17N",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",-81,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.9996,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["Between 84°W and 78°W, northern hemisphere between equator and 84°N, onshore and offshore. Bahamas. Ecuador - north of equator. Canada - Nunavut; Ontario; Quebec. Cayman Islands. Colombia. Costa Rica. Cuba. Jamaica. Nicaragua. Panama. United States (USA)."],
## BBOX[0,-84,84,-78]],
## ID["EPSG",32617]]
# Create a quarter mile (402.336 meters) buffer around each housing unit
cvilleHomes_utm_quarter_mile_sf <- st_buffer(cvilleHomes_utm_sf, 402.336)
# Now create a new attribute in the sf object with the number of reported
# aggravated assaults within that buffer...
cvilleHomes_utm_sf <- cvilleHomes_utm_sf %>%
mutate(NumAggAssaults = lengths(st_intersects(cvilleHomes_utm_quarter_mile_sf, cvilleAssaults_utm_sf)))
# Visualize the location of housing units sold, reported aggravated assaults,
# and the distance buffer we are using (i.e., in red)...
ggplot() +
geom_sf(data = nhoods_utm_sf, aes(fill = "white"), color = "grey25") +
geom_sf(data = cvilleAssaults_utm_sf, aes(col = "blue"), pch = 20, size = 2) +
geom_sf(data = cvilleHomes_utm_quarter_mile_sf[c(89, 302), ], color = "red", fill = NA) +
geom_sf(data = st_centroid(cvilleHomes_utm_sf), fill = NA, aes(color = "grey75"), pch = 20, size = 1.2) +
scale_fill_manual(name = "Legend", values = c("white", "pink"), labels = c("Neighborhoods", "Example Buffer")) +
scale_color_manual(name = "", values = c("blue", "grey75"), labels = c("Assaults", "Homes Sold")) +
theme_void() +
theme(text = element_text(size = 20))
## Warning in st_centroid.sf(cvilleHomes_utm_sf): st_centroid assumes attributes
## are constant over geometries of x
Note that in the map above, the red circles are examples of the one-quarter mile buffer used to count the number of reported aggravated assaults near each home in our dataset. In some cases the count is zero and in others we can see that there are multiple blue points.
In the code chunk below, we use a new R package called sfdep to generate a spatial weights matrix, to create a spatially lagged home price feature, and calculate spatial statistics such as the Global Moran’s I. A spatially lagged variable or feature is simply the weighted average of the sales prices of all homes that are considered “neighbors” of this particular home and this approach is used frequently in spatial econometrics. If the spatial lag of sales prices is an important feature, that would suggest lots of clustering (e.g., Euclidean zoning, segregation based on race/ethnicity/income) in the distribution of residential real estate prices in our study area. This page provides a little more detail on spatial neighbors in R and even has an interactive app where you can explore how changing the way “neighboring” relationships are defined might affect the inferences we take away. If you need to use inverse distance weights, this page demonstrates how to generate them.
# Create a neighbor list for homes sold based on its 5 nearest neighbors....
cvilleHomes_utm_centroids_sf <- st_centroid(cvilleHomes_utm_sf)
## Warning in st_centroid.sf(cvilleHomes_utm_sf): st_centroid assumes attributes
## are constant over geometries of x
cvilleHomes_knn <- sfdep::st_knn(cvilleHomes_utm_centroids_sf, k = 5)
# Create lagged version of sales price for each home sold that is the
# weighted average of the sales prices of its 5 nearest neighbors...
cvilleHomes_utm_sf <- cvilleHomes_utm_sf %>%
mutate(lagSalAmnt = sfdep::st_lag(x = .$SalAmnt, nb = cvilleHomes_knn,
wt = st_weights(cvilleHomes_knn, allow_zero = TRUE), allow_zero = TRUE))
# Plot sales price against lagged sales price, like the Steif chapter...
ggplot() +
geom_point(data = cvilleHomes_utm_sf, mapping = aes(x = lagSalAmnt, y = SalAmnt), fill = "orange", color = "white",
alpha = 0.25, pch = 21, size = 3, stroke = 1.2) +
geom_smooth(data = cvilleHomes_utm_sf, method = "lm", mapping = aes(x = lagSalAmnt, y = SalAmnt), se = FALSE,
color = "green4", alpha = 0.4) +
labs(title = "Correlation between sale price \n and the spatial lag of sale price",
y = "Sale Amount ($)", x = "Mean of 5 Nearest Home Sale Prices ($)") +
scale_y_continuous(labels=scales::dollar_format()) +
scale_x_continuous(labels=scales::dollar_format()) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
text = element_text(size = 20))
## `geom_smooth()` using formula = 'y ~ x'
# Say it another way....
print(str_c("Lagged sales price explains ",
round(((summary(
lm(SalAmnt ~ lagSalAmnt, data = cvilleHomes_utm_sf))$adj.r.squared) * 100), 2),
" percent of the variation in observed home sales prices.")
)
## [1] "Lagged sales price explains 84.78 percent of the variation in observed home sales prices."
# Test for presence of spatial autocorrelation in home sales prices...
cvilleHomes_moran <- sfdep::global_moran_test(x = cvilleHomes_utm_sf$SalAmnt, nb = cvilleHomes_knn,
wt = st_weights(cvilleHomes_knn, allow_zero = TRUE))
cvilleHomes_moran
##
## Moran I test under randomisation
##
## data: x
## weights: listw
##
## Moran I statistic standard deviate = 71.208, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.8565050992 -0.0004458315 0.0001448273
In the above code chunk, the scatterplot shows that as sales price increases, so does the price of nearby houses—and vice versa. Creating a spatially lagged feature is one approach for leveraging spatial autocorrelation in a predictive model like this, but we could also use neighborhood boundaries as a proxy for similar spillover effects 😨
That chunk also performs a Global
Moran’s I test for spatial autocorrelation using the
sfdep::global_moran_test function. The significant
p-value indicates that the result is statistically significant,
while the 0.85 value of the I statistic itself suggests the presence of
strong, positive spatial autocorrelation (i.e., homes with high sales
prices are likely to be located near other homes with high sales
prices).
# Link neighborhood attributes to home sales dataset...
cvilleHomes_utm_sf <- st_join(cvilleHomes_utm_sf, nhoods_utm_sf, join = st_within, left = TRUE)
dim(cvilleHomes_utm_sf)
## [1] 2244 26
# Now filter out unwanted obserations and make the neighborhood
# attribute a factor rather than a character type...
cvilleHomes_utm_sf <- cvilleHomes_utm_sf %>%
filter(SalAmnt > 0.0,
SqrFtFL > 0.0,
is.na(NeighHood) == FALSE) %>%
mutate_at("NeighHood", as_factor)
dim(cvilleHomes_utm_sf)
## [1] 2040 26
The code chunk below partitions (divides) the dataset into two
pieces—a training
dataset with approximately 75 percent of the observations and a test
dataset with the remaining approximately 25 percent of the
observations. It also stratifies on neighborhood name. We use the
caret::trainControl function to set up the standard
k-folds cross-validation routine with k = 10, but it
could just as easily be 5 folds as illustrated in the image below.
As described in class, k-folds cross-validation can help us avoid overfitting and assist with tuning hyperparameters for a given machine learning algorithm. Cross-validation involves randomly dividing the data into k “slices” or folds of roughly the same size. The first fold is treated as a validation set and we train the model on the remainder of the data (i.e., the other 9 slices if k = 10). Next, we calculate the Mean Squared Error or another performance metric when the trained model is applied to the one fold that has been set aside. We repeat this process k times with a different fold or “slice” acting as the validation set. This technique provides us with k estimates of the test error rate as opposed to the single test error rate that we would otherwise have from the training set.
As mentioned in class, the caret package currently supports
over 230 algorithms and by setting the method argument
of the caret::train function to “rpart”, we are fitting a
simple decision tree model. Because the outcome we are predicting is
continuous rather than categorical, it is a regression tree rather than
a classification tree.
# This allows us to reproduce the results in the future...
set.seed(42)
#################################################################################
# This function is part of the caret package and the y argument outlines
# how to stratify the data. From the documentation "the sample is split
# into groups sections based on percentiles and sampling is done within
# these subgroups" and the paste function simply concatenates several features.
#
# This allows us to stratify on multiple features rather than just the outcome
# we are trying to predict (i.e., the training and test sets are more comparable)
#################################################################################
inTrain <- createDataPartition(
y = cvilleHomes_utm_sf$NeighHood,
p = 0.75, list = FALSE)
cville_training <- cvilleHomes_utm_sf[inTrain, ]
cville_testing <- cvilleHomes_utm_sf[-inTrain, ]
cville_training %>%
st_drop_geometry() %>%
rstatix::get_summary_stats()
cville_testing %>%
st_drop_geometry() %>%
rstatix::get_summary_stats()
# Set up for standard, 10 fold cross-validation...
tree_ctrl <- trainControl(method = "cv", number = 10)
tree_model <- train(SalAmnt ~ SqrFtFL + Bedroms + FllBthr + Age + LotAcrs + Fireplc + NmbrOfS.cat + NumAggAssaults,
data = cville_training %>%
st_drop_geometry() %>%
drop_na() %>%
dplyr::select(SalAmnt, SqrFtFL, Bedroms, FllBthr, Age, LotAcrs, Fireplc,
NmbrOfS.cat, NumAggAssaults, PrclNmb),
method = "rpart",
trControl = tree_ctrl,
tuneLength = 10)
print(tree_model)
## CART
##
## 1233 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1109, 1110, 1111, 1109, 1110, 1110, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01209965 177914.7 0.6468659 118296.8
## 0.01425456 181864.8 0.6317297 120806.1
## 0.01492998 182582.1 0.6298142 120973.5
## 0.01934665 188500.5 0.6087911 124453.7
## 0.02263863 192809.3 0.5919239 127046.6
## 0.03178573 201489.9 0.5559143 132747.1
## 0.05747887 216585.0 0.5021451 138916.8
## 0.08552746 227052.4 0.4427115 145874.7
## 0.12130273 248144.8 0.3389001 164249.3
## 0.31859925 292737.6 0.2052107 188593.8
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01209965.
# Note the scale argument defaults to TRUE and the importance values are
# normalized to range between 0 and 100...
caret::varImp(tree_model)
## rpart variable importance
##
## Overall
## SqrFtFL 100.00
## LotAcrs 89.89
## Age 85.86
## Bedroms 79.13
## Fireplc 73.56
## FllBthr 41.09
## NumAggAssaults 26.02
## `NmbrOfS.cat4+ Floors` 0.00
## `NmbrOfS.cat3 Floors` 0.00
# We can avoid this normalization by setting the scale argument to FALSE
tree_model_importance <- caret::varImp(tree_model, scale = FALSE)
# We can also visualize the most important features in our model this way...
plot(tree_model_importance, top = 5)
In the above chunk, we introduce the caret::varImp
function which provides
insight into which features we have included in the model are most
important for making th resulting predictions. Without getting too much
into the weeds, 🌲 this function involves permuting the values of a
given feature, re-estimating the model, and noting how much the
prediction error changes on average1.
We fit the model on the training dataset in the above code chunk,
then we applied it to the test dataset in the code chunk above using the
predict function in the code chunk below. The MAE and the
Absolute Percent Error are also calculated based on the known sales
price AND the predicted sale prices from the model.
cville_testing_no_geometry <-
cville_testing %>%
st_drop_geometry() %>%
drop_na() %>%
dplyr::select(SalAmnt, SqrFtFL, Bedroms, FllBthr, Age, LotAcrs, Fireplc,
NmbrOfS.cat, NumAggAssaults, PrclNmb)
tree_model_testing <- predict(tree_model, cville_testing_no_geometry)
outputPredictions <- cville_testing_no_geometry %>%
mutate(
SalePrice_AbsError = abs(tree_model_testing - cville_testing_no_geometry$SalAmnt),
SalePrice_APE = ((abs(tree_model_testing - cville_testing_no_geometry$SalAmnt)) / tree_model_testing) * 100) %>%
filter(SalAmnt < 5000000)
# Assess accuracy...
mean(outputPredictions$SalePrice_AbsError, na.rm = T)
## [1] 112276.6
mean(outputPredictions$SalePrice_APE, na.rm = T)
## [1] 23.44235
# Map model performance...
cville_testing_to_plot <- left_join(cville_testing, outputPredictions %>%
select(PrclNmb, SalePrice_AbsError, SalePrice_APE),
by = "PrclNmb")
tmap_mode("view")
## tmap mode set to interactive viewing
cville_testing_to_plot <- st_make_valid(cville_testing_to_plot) %>%
drop_na(SalePrice_AbsError)
tm_shape(nhoods_utm_sf, name = "Neighborhoods") +
tm_borders(col = "dodgerblue") +
tm_shape(cville_testing_to_plot, "Prediction Error of Simple Regression Tree Model") +
tm_dots(shape = 21, col = "SalePrice_AbsError", palette = "viridis",
style = "quantile", colorNA = "orange", textNA = "Unknown", title = "Absolute Error ($)",
popup.vars = c("Sale Amount: " = "SalAmnt", "Square Footage: " = "SqrFtFL",
"Lot (Acres): " = "LotAcrs", "% Prediction Error: " = "SalePrice_APE"),
legend.format = list(fun = function(x) paste0("$", formatC(x, digits = 0, big.mark=',', format = "f"))),
id = "PrclNmb") +
tmap_options(check.and.fix = TRUE) +
tm_view(set.view = c(-78.4837122981158, 38.03437353596851, 13)) +
tm_basemap(c("Esri.WorldGrayCanvas", "CartoDB.Positron", "Esri.WorldTopoMap", "CartoDB.DarkMatter"))
These metrics and the interactive map give us a sense of how well the model performs when applied to new data. Celebrate! 🥳
fancyRpartPlot(tree_model$finalModel, main = "Simple Decision Tree Model", sub = "Charlottesville Housing Sales", digits = -3)
The code chunk above uses a new R package called rattle to visually display the decision tree created when we trained our model. Recall that decision trees are “grown” and read from the top to the bottom. Although the cville_training set contains 1546 observations, the model is fit using 1233 of these due to filtering of missing data values in a prior chunk. We can see in the figure that the first split is based on the SqrFtFL (i.e., the square footage of the home) feature and branches to the left if an observation is less 2,816 square feet and to the right otherwise. From there, other important features like , FllBthr (i.e., number of full bathrooms), LotAcrs (i.e., size of the lot), and Age (i.e., how old is the home?).
At each internal node (i.e., between the root node and the leaves or terminal nodes), the plot displays the number and percentage of observations that “flow” through that node of the decision tree, as well as what the model would predict as the sales price.
In this section of the Assignment, we train a model using the random
forest algorithm from the randomForest package by setting the
method argument of the caret::train
function to “rf” rather than “rpart” which trains a
simple regression tree model. So what distinguishes the random forest
algorithm from a simple decision tree algorithm? Before we can answer
that question, we have to introduce the concept of bagging (more
information in the next class session):
Bootstrap aggregating of bagging “creates b new bootstrap samples by drawing samples with replacement of the original training data. Bagging, is one of the first ensemble algorithms machine learning practitioners learn and is designed to improve the stability and accuracy of regression and classification algorithms. By model averaging, bagging helps to reduce variance and minimize overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method.
A random forest is a special type of bagged ensemble model where a large number of individual decision trees (e.g., hundreds or thousands) are generated, but these individual trees are de-correlated. This means that each time a split is to be performed, the search for the split variable is limited to a random subset of mtry (i.e., one of the algorithm’s hyperparamters) of the total number of features available.
So why is the random forest algorithm popular?
# Set up for standard, 10 fold cross-validation...
tree_ctrl <- trainControl(method = "cv", number = 10)
rf_model <- train(SalAmnt ~ SqrFtFL + Bedroms + FllBthr + Age + LotAcrs + Fireplc + NmbrOfS.cat + NumAggAssaults,
data = cville_training %>%
st_drop_geometry() %>%
drop_na() %>%
dplyr::select(SalAmnt, SqrFtFL, Bedroms, FllBthr, Age, LotAcrs, Fireplc,
NmbrOfS.cat, NumAggAssaults, PrclNmb, NeighHood, lagSalAmnt),
method = "rf",
trControl = tree_ctrl,
tuneLength = 10)
## note: only 8 unique complexity parameters in default grid. Truncating the grid to 8 .
print(rf_model)
## Random Forest
##
## 1233 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1110, 1110, 1109, 1110, 1110, 1110, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 118092.25 0.8785299 76788.00
## 3 90849.46 0.9133282 52633.81
## 4 87215.62 0.9182268 47169.80
## 5 86392.53 0.9192250 45268.06
## 6 87074.39 0.9176254 44811.01
## 7 86865.48 0.9180409 44764.84
## 8 87816.69 0.9161499 44822.99
## 9 89866.89 0.9121350 45509.27
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
varImp(rf_model, scale = FALSE)
## rf variable importance
##
## Overall
## SqrFtFL 4.183e+13
## Age 2.085e+13
## LotAcrs 1.622e+13
## FllBthr 1.116e+13
## Fireplc 8.941e+12
## Bedroms 7.819e+12
## NumAggAssaults 4.168e+12
## NmbrOfS.cat4+ Floors 0.000e+00
## NmbrOfS.cat3 Floors 0.000e+00
rf_model_importance <- varImp(rf_model, scale = TRUE)
plot(rf_model_importance, top = 5)
You should also recall that caret supports over 230 machine
learning algorithms and there are multiple variations on the random
forest algorithm. As shown
in the documentation there are multiple options for the
method argument of the caret::train
function that train a random forest, but the one we have selected here
has a single hyperparameter—it is called mtry and
represents the number of features to consider at each split point. This
means that values of mtry may range between 1 and the
total number of features in your model. Another approach would be to use
method = 'ranger' to train a random forest algorithm using
the ranger package, which allows us to tune two additional
hyperparameters—splitrule and
min.node.size. You could find more information on these
two hyperparamters by referring
to the documentation of the R package that caret is calling
in the background, which is ranger
in that alternate case but is randomForest
in our present example.
cville_testing_no_geometry <-
cville_testing %>%
st_drop_geometry() %>%
drop_na() %>%
dplyr::select(SalAmnt, SqrFtFL, Bedroms, FllBthr, Age, LotAcrs, Fireplc,
NmbrOfS.cat, NumAggAssaults, PrclNmb, NeighHood, lagSalAmnt)
rf_model_testing <- predict(rf_model, cville_testing_no_geometry)
outputPredictions_rf <- cville_testing_no_geometry %>%
mutate(
SalePrice_AbsError = abs(rf_model_testing - cville_testing_no_geometry$SalAmnt),
SalePrice_APE = ((abs(rf_model_testing - cville_testing_no_geometry$SalAmnt)) / rf_model_testing) * 100) %>%
filter(SalAmnt < 5000000)
# Assess accuracy...
mean(outputPredictions_rf$SalePrice_AbsError, na.rm = T)
## [1] 38358.15
mean(outputPredictions_rf$SalePrice_APE, na.rm = T)
## [1] 9.123643
# Create interactive map of RF model errors...
cville_testing_rf_to_plot <- left_join(cville_testing, outputPredictions_rf %>%
select(PrclNmb, SalePrice_AbsError, SalePrice_APE),
by = "PrclNmb")
tmap_mode("plot")
## tmap mode set to plotting
tmap_mode("view")
## tmap mode set to interactive viewing
cville_testing_rf_to_plot <- st_make_valid(cville_testing_rf_to_plot) %>%
drop_na(SalePrice_AbsError)
tm_shape(nhoods_utm_sf, name = "Neighborhoods") +
tm_borders(col = "dodgerblue") +
tm_shape(cville_testing_rf_to_plot, "Prediction Error of Random Forest Model") +
tm_dots(shape = 21, col = "SalePrice_AbsError", palette = "inferno",
style = "quantile", colorNA = "orange", textNA = "Unknown", title = "Absolute Error ($)",
popup.vars = c("Sale Amount: " = "SalAmnt", "Square Footage: " = "SqrFtFL",
"Lot (Acres): " = "LotAcrs", "% Prediction Error: " = "SalePrice_APE"),
legend.format = list(fun = function(x) paste0("$", formatC(x, digits = 0, big.mark=',', format = "f"))),
id = "PrclNmb") +
tmap_options(check.and.fix = TRUE) +
tm_view(set.view = c(-78.4837122981158, 38.03437353596851, 13)) +
tm_basemap(c("Esri.WorldGrayCanvas", "CartoDB.Positron", "Esri.WorldTopoMap", "CartoDB.DarkMatter"))
Sometimes real estate characteristics are constant across properties in the same neighborhood (e.g., access to a common swimming pool or walking trails) and the code chunk below uses the name of the neighborhood as a proxy for all those unobserved characteristics that vary across neighborhoods in Charlottesville, but that are the same for homes located within a given neighborhood . Let’s use the rstatix package to quickly get a sense of how well the random forest model performed across neighborhoods in Charlottesville. 💪
kbl(
cville_testing_rf_to_plot %>%
st_drop_geometry() %>%
group_by(NeighHood) %>%
rstatix::get_summary_stats(SalePrice_AbsError),
digits = 3, align = "lccrr", caption = "Observed Versus Predicted Sales Price by Neighborhood") %>%
kable_paper("hover", full_width = TRUE)
| NeighHood | variable | n | min | max | median | q1 | q3 | iqr | mad | mean | sd | se | ci |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Greenbrier | SalePrice_AbsError | 17 | 249.356 | 70381.808 | 39800.595 | 2284.537 | 43738.660 | 41454.123 | 45339.706 | 30916.427 | 28581.233 | 6931.967 | 14695.114 |
| Rutledge | SalePrice_AbsError | 36 | 379.542 | 66471.315 | 15046.934 | 5356.201 | 60220.082 | 54863.881 | 21745.875 | 29657.790 | 27136.283 | 4522.714 | 9181.597 |
| Angus Road Area | SalePrice_AbsError | 5 | 6622.957 | 10776.871 | 6622.957 | 6622.957 | 6622.957 | 0.000 | 0.000 | 7453.740 | 1857.687 | 830.783 | 2306.623 |
| Meadowbrook Hills | SalePrice_AbsError | 21 | 1345.967 | 31338.827 | 5184.833 | 5184.833 | 5184.833 | 0.000 | 0.000 | 6415.810 | 5828.575 | 1271.899 | 2653.136 |
| Greenleaf Terrace/Rose Hill/Rugby Hills | SalePrice_AbsError | 42 | 1215.864 | 51243.833 | 22816.076 | 3951.324 | 46074.093 | 42122.768 | 29011.650 | 24444.332 | 18450.504 | 2846.975 | 5749.582 |
| Woodhayven | SalePrice_AbsError | 1 | 9807.116 | 9807.116 | 9807.116 | 9807.116 | 9807.116 | 0.000 | 0.000 | 9807.116 | NA | NA | NaN |
| Rugby | SalePrice_AbsError | 27 | 195.788 | 169282.479 | 17813.261 | 195.788 | 59469.325 | 59273.537 | 26119.665 | 42756.975 | 58181.255 | 11196.988 | 23015.738 |
| Holmes/North Avenue Area | SalePrice_AbsError | 8 | 6934.634 | 36678.409 | 21806.522 | 6934.634 | 36678.409 | 29743.774 | 22049.060 | 21806.522 | 15898.716 | 5621.045 | 13291.659 |
| St. Charles Place | SalePrice_AbsError | 2 | 10835.527 | 11598.529 | 11217.028 | 11026.277 | 11407.779 | 381.501 | 565.614 | 11217.028 | 539.524 | 381.501 | 4847.432 |
| Locust Grove Extended | SalePrice_AbsError | 24 | 604.147 | 73833.170 | 1873.459 | 604.147 | 66583.415 | 65979.268 | 1881.883 | 27479.523 | 34287.945 | 6998.998 | 14478.529 |
| Davis Avenue/Marshall Street Area | SalePrice_AbsError | 4 | 59020.924 | 59020.924 | 59020.924 | 59020.924 | 59020.924 | 0.000 | 0.000 | 59020.924 | 0.000 | 0.000 | 0.000 |
| Towles/Merryden/Ivy Terrace | SalePrice_AbsError | 11 | 2312.841 | 63195.617 | 40560.267 | 19256.168 | 54985.717 | 35729.548 | 31585.456 | 36639.660 | 21104.442 | 6363.229 | 14178.157 |
| North Downtown | SalePrice_AbsError | 43 | 1738.105 | 529309.598 | 33638.261 | 6776.798 | 58632.390 | 51855.593 | 37661.666 | 89877.056 | 152930.049 | 23321.614 | 47064.922 |
| Rose Hill/Forrest Street | SalePrice_AbsError | 9 | 6366.080 | 6366.080 | 6366.080 | 6366.080 | 6366.080 | 0.000 | 0.000 | 6366.080 | 0.000 | 0.000 | 0.000 |
| Venable | SalePrice_AbsError | 2 | 10026.958 | 74985.479 | 42506.219 | 26266.589 | 58745.849 | 32479.260 | 48153.751 | 42506.219 | 45932.610 | 32479.260 | 412688.130 |
| Venable/Page/10th Street | SalePrice_AbsError | 12 | 6275.865 | 89731.829 | 19860.268 | 6275.865 | 89731.829 | 83455.964 | 20140.235 | 38472.386 | 38418.777 | 11090.546 | 24410.126 |
| Locust Grove | SalePrice_AbsError | 16 | 17562.550 | 202033.425 | 17562.550 | 17562.550 | 80110.065 | 62547.515 | 0.000 | 65891.669 | 81405.432 | 20351.358 | 43377.893 |
| Court Square/Central Business District | SalePrice_AbsError | 1 | 10886.233 | 10886.233 | 10886.233 | 10886.233 | 10886.233 | 0.000 | 0.000 | 10886.233 | NA | NA | NaN |
| Golf Club | SalePrice_AbsError | 18 | 344.225 | 72068.362 | 20476.032 | 2186.013 | 25155.130 | 22969.117 | 26065.951 | 25111.585 | 27359.326 | 6448.655 | 13605.473 |
| Little High Street/East Jefferson Street | SalePrice_AbsError | 4 | 610.215 | 610.215 | 610.215 | 610.215 | 610.215 | 0.000 | 0.000 | 610.215 | 0.000 | 0.000 | 0.000 |
| Fife Estate | SalePrice_AbsError | 24 | 563.092 | 75023.763 | 12686.934 | 6534.271 | 12686.934 | 6152.663 | 9121.938 | 20111.892 | 25642.142 | 5234.180 | 10827.727 |
| University/Maury Hills | SalePrice_AbsError | 3 | 858.457 | 95073.172 | 28485.376 | 14671.917 | 61779.274 | 47107.357 | 40959.670 | 41472.335 | 48431.384 | 27961.872 | 120310.227 |
| Forest Hills | SalePrice_AbsError | 10 | 1075.722 | 23398.777 | 17388.323 | 5043.669 | 23398.777 | 18355.108 | 8911.099 | 14474.764 | 9697.229 | 3066.533 | 6936.979 |
| JPA/Shamrock Road | SalePrice_AbsError | 53 | 221.800 | 58426.491 | 7311.989 | 3688.137 | 17811.894 | 14123.758 | 5372.722 | 16079.507 | 18134.350 | 2490.944 | 4998.446 |
| North Belmont | SalePrice_AbsError | 103 | 39.830 | 45543.778 | 39.830 | 39.830 | 39.830 | 0.000 | 0.000 | 5420.879 | 13827.652 | 1362.479 | 2702.471 |
| Ix/Belmont | SalePrice_AbsError | 182 | 32.933 | 8799.333 | 33.673 | 32.933 | 33.673 | 0.740 | 0.549 | 313.731 | 1226.083 | 90.883 | 179.327 |
| Johnson Village | SalePrice_AbsError | 4 | 7613.389 | 21283.916 | 18509.295 | 15198.487 | 19789.782 | 4591.295 | 2636.851 | 16478.974 | 6087.005 | 3043.503 | 9685.784 |
| Carlton/Belmont | SalePrice_AbsError | 29 | 9934.164 | 69899.817 | 53100.183 | 23794.371 | 69899.817 | 46105.446 | 24907.138 | 46896.307 | 23740.213 | 4408.447 | 9030.294 |
| Burnet Commons | SalePrice_AbsError | 10 | 4878.703 | 59167.013 | 12039.285 | 12039.285 | 59167.013 | 47127.728 | 10014.081 | 29539.495 | 25632.209 | 8105.616 | 18336.178 |
| Belmont | SalePrice_AbsError | 88 | 83.357 | 257055.491 | 98948.683 | 23443.922 | 131051.317 | 107607.395 | 107965.155 | 110024.178 | 91335.858 | 9736.435 | 19352.219 |
| Ridge Street | SalePrice_AbsError | 18 | 770.982 | 94259.708 | 75740.292 | 62278.215 | 94259.708 | 31981.493 | 27456.887 | 65665.669 | 30569.749 | 7205.359 | 15201.978 |
| Huntley | SalePrice_AbsError | 4 | 6864.159 | 17590.206 | 13034.925 | 11044.744 | 14621.234 | 3576.490 | 3819.129 | 12631.054 | 4430.518 | 2215.259 | 7049.944 |
| Fry’s Spring | SalePrice_AbsError | 30 | 2429.874 | 34684.884 | 34176.672 | 7206.258 | 34176.672 | 26970.414 | 0.000 | 23094.293 | 14100.827 | 2574.447 | 5265.335 |
| Azalea Gardens/Green Valley | SalePrice_AbsError | 18 | 332.018 | 23607.881 | 2090.738 | 516.071 | 8331.971 | 7815.900 | 2607.479 | 4577.979 | 5932.374 | 1398.274 | 2950.100 |
| Brookwood | SalePrice_AbsError | 13 | 132.871 | 43844.981 | 17734.470 | 132.871 | 43844.981 | 43712.110 | 26096.131 | 19233.632 | 18832.531 | 5223.204 | 11380.385 |
| Willoughby | SalePrice_AbsError | 1 | 9863.590 | 9863.590 | 9863.590 | 9863.590 | 9863.590 | 0.000 | 0.000 | 9863.590 | NA | NA | NaN |
| Lochlyn Hill | SalePrice_AbsError | 68 | 1131.164 | 356479.208 | 70665.948 | 54334.052 | 291266.651 | 236932.599 | 86355.858 | 151871.649 | 131158.400 | 15905.292 | 31747.096 |
| Stonehenge Extended | SalePrice_AbsError | 94 | 3898.536 | 25350.974 | 14320.451 | 12089.026 | 25350.974 | 13261.948 | 3474.030 | 17435.348 | 6021.983 | 621.120 | 1233.421 |
The above code chunk enlists the kableExtra::kbl
function to gently format a table for us. You can learn more about this
function here, but I have not found table formatting to be a
particularly useful place to channel effort. Packages like
formattable and pixiedust are reviewed at the links
below if you want to take a deeper dive into table formatting:
As outlined in greater detail here, the neighborhood effects model not only fits our data better, it is also likely to perform better when it “sees new data” its accuracy is more consistent across neighborhoods. This speaks directly to Steif’s definition of generalizability which consists of:
“… the ability to predict accurately on new data - which was the focus of cross-validation in Chapter 3” and “…the ability to predict with comparable accuracy across different group contexts, like neighborhoods” otherwise, “…the algorithm may not be fair” and frankly not very useful…
Recall the plot of RMSE and R-Squared across the k-Folds
validation from Lab Exercise #6 that we can access like this
rf_model$resample. Those plots speak to the first aspect of
generalizability as articulated here—whether the model
performs at a comparable level when it “sees new data” that have not
been used to train and test it. The table displayed here instead
addresses the second aspect of generalizability and
provides insight into whether the model’s accuracy is consistent across
neighborhoods.
When you have read and understand what the preceding code chunks are doing, please proceed… 👑
This exercise asks you to experiment with two features that we
created, but that we have not yet included in our
models—NeighHood and lagSalAmnt. By
drawing on code in the three preceding code chunks, write your own code
that trains and tests a random forest model that includes
NeighHood and an alternate that includes
lagSalAmnt then respond to the questions posed
below:
DataExplorer::plot_correlation(cville_training %>% drop_na() %>% select(SalAmnt, SqrFtFL, Bedroms, FllBthr, Age, Fireplc, LotAcrs, NumAggAssaults))
we see that the presence of a fireplace is highly correlated with sales
price AND with square footage of the home. Sometimes we
can achieve addition through subtraction 🤣 so let’s see if this is one
of those times!
Please submit an R Notebook and knitted HTML file that shows your work and responses for each of the Exercise included in this Assignment. Also, briefly comment on your experience with R this week (i.e., provide the requested Reflective Content described in the rubric below). Please upload your report to Collab by 5:00 pm on Friday April 21st.
This Lab Exercise will be graded on a 100-point scale according to the rubric below:
Length and formatting (10 pts)
Clarity of writing and attention to detail (20 pts)
Technical Content (45 pts)
Reflective Content (25 pts)
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.↩︎